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DETAILED ACTION 

1 . A request for continued examination under 37 CFR 1.114, including the fee set 
forth in 37 CFR 1 .17(e), was filed in this application after final rejection. Since this 
application is eligible for continued examination under 37 CFR 1.114, and the fee set 
forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action 
has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 22 
March 2005 has been entered. 

2. Claim 6 has been canceled. 
Claim 1 has been amended. 

3. Claims 1 -5 are pending and under examination. 

4. The text of those sections of Title 35, U.S. Code not included in this action can 
be found in a prior Office action. 

5. This Office Action contains New Grounds of Rejections. 

Inventorship 

6. The request for the deletion of inventors Eaton, Filvaroff, Gerristen and 
Watanabe is approved and the inventors have been deleted. 

Rejections Withdrawn 

7. The rejection of claims 1-5 under 35 U.S.C. 103(a) as being unpatentable over 
Lai et al (WO 00/00610, 1/6/2000, cited previously) in view of Queen et al (US Patent 
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5,530,101 , issued 6/1996) is withdrawn in view of applicants arguments and the fact 
that Lai et al do not teach the expression of the polypeptide or a function for the protein. 

Response to Arguments 

8. The rejection of claims 1-5 under 35 U.S.C 101 because the claimed invention is 
not supported by a substantial asserted utility or a well-established utility is maintained. 

The response filed 3/22/2005 has been carefully considered, but is deemed not 
to be persuasive. Applicant reviews the evidentiary standard regarding the legal 
presumption of utility. The examiner takes no issue with Applicant's discussion of the 
evidentiary standard regarding the legal presumption of utility. Applicant argues that the 
utility need not be proved to a statistical certainty, a reasonable correlation between the 
evidence and the asserted utility is sufficient and applicant cites numerous case law in 
support of applicants arguments that for a therapeutic and diagnostic use, utility does 
not have to be established to an absolute certainty and the evidence need not be direct 
evidence so long as there is a reasonable correlation between the evidence and the 
asserted utility. Applicant argues that as set forth in MPEP 2107 ll(B)(1 ) "If applicant 
has asserted that the claimed invention is useful for any particular practical purpose... 
and the assertion would be considered credible by a person of ordinary skill in the art, 
do not impose a rejection based on lack of utility." In response to these arguments, the 
examiner agrees with Applicant's statement that absolute certainty is not the legal 
standard for utility. However, the rejection does not question the presumption of truth, 
or credibility, of the asserted utility. The asserted utilities of cancer diagnostics and 
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cancer therapeutics for the claimed polypeptides are credible and specific, however, 
they are not substantial. The data set forth in the specification are preliminary at best 
because the specification does not teach the expression of the PRO1069 polypeptide 
nor any particular biological activity of the polypeptide. Applicant summarizes their 
arguments and the disputed issues involved. Applicant reiterates that Example 18 in 
the specification shows that mRNA encoding the PRO1069 polypeptide is more highly 
expressed in normal kidney compared to kidney tumor and applicant asserts that it is 
well-established in the art that a change in the level of mRNA for a particular protein, 
generally leads to a corresponding change in the level of the encoded protein and 
based on the identification of the mRNA encoding the PRO1069 polypeptide under- 
expressed in tumor tissue compared to normal tissue renders the PRO1069 polypeptide 
useful as a diagnostic tool for the determination of the presence or absence of tumor. In 
support, applicant again argues with the declaration of J. Christopher Grimaldi 
(previously submitted as Exhibit 1) that there is at least a two-fold difference in 
PRO1069 mRNA between kidney tumor and normal kidney tissue. This has been fully 
considered, but is not found persuasive. First, it is important to note that the instant 
specification provides no information regarding PRO1069 polypeptide levels in tumor 
samples relative to normal samples. Only gene expression data was presented. 
Therefore, the declaration is insufficient to overcome the rejection of claims 4-9, 11-17 
based upon 35 U.S.C. 101 and 112, first paragraph, since it is limited to a discussion of 
data regarding the gene expression of the PRO1069 cDNA and not gene expression 
levels and polypeptide levels. Furthermore, the declaration does not provide data such 
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that the examiner can independently draw conclusions. There is no evidentiary support 
to Dr. Grimaldi's statement that if a difference in gene expression is detected, this 
indicates that the gene and its corresponding polypeptide and antibodies against the 
polypeptide are useful for diagnostic purposes, to screen samples to differentiate 
between normal and tumor. Finally, it is noted that the literature cautions researchers 
from drawing conclusions based on small changes in transcript expression levels 
between normal and cancerous tissue. For example, Hu et al (Journal of Proteome 
Research 2:405-412, 2003, Ids reference 23 filed 3/31/2005) analyzed 2286 genes that 
showed a greater than 1 -fold difference in mean expression level between breast 
cancer samples and normal samples in a micoarray (p. 408, middle of right column). 
Hu et al. discovered that, for genes displaying a 5-fold change or less in tumors 
compared to normal, there was no evidence of a correlation between altered gene 
expression and a known role in the disease. However, among genes with a 10-fold or 
more change in expression level, there was a strong and significant correlation between 
expression level and a published role in the disease (see discussion section). 

Applicant argues that they have established that the accepted understanding in 
the art is that there is a direct correlation between mRNA levels and the level of 
expression of the encoded protein and applicant argues with the previously submitted 
second declaration of J. Christopher Grimaldi (previously submitted as Exhibit 2), which 
states that those who work in this field are well aware that in the vast majority of cases, 
when a gene is over-expressed ...the gene product or polypeptide will also be over- 
expressed and this same principle applies to gene under-expression. Further, applicant 
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argues with the declaration of Dr. Paul Polakis (previously submitted as Exhibit 3) which 
states that based upon his experience accumulated in more than 20 years of research, 
that it is his scientific opinion that for human genes, an increased level of mRNA in a 
tumor cell relative to a normal cell typically correlates to a similar increase of the :•. 
encoded protein in the tumor cell relative to the normal cell and that based on his 
experience although reports exist where such a correlation does not exist, such reports 
are exceptions to a commonly understood general rule that increased mRNA levels are 
predictive of corresponding increased levels of the encoded protein and applicant cites 
Alberts [a] (4 th ed. 2002; Exhibit 2), Alberts [b] (3 rd ed. 1994; Exhibit 1), Lewin and ; 
Zhigang for support that mRNA expression correlates with protein expression. The 
declarations of Dr. Grimaldi and Dr. Polakis and applicant's arguments have been fully 
considered, but are not found persuasive. Alberts [b] and Lewin actually support the 
fact that further research would have to be carried out to determine if the polypeptide 
expression levels track with the expression levels of the corresponding mRNA. Alberts 
and Lewin show that there are several levels that control gene expression both at the 
transcriptional (i.e., mRNA synthesis) and the translational (i.e., protein production) 
levels. Thus, one skilled in the art would not accept that increased mRNA levels directly 
correlate with the level of the corresponding polypeptide in view of the multitude of 
controls at the transcriptional and translational levels. With respect to applicant's 
arguments regarding the art of Zhigang et al, the art of Zhigang et al does show protein 
expression, however, the experiments were carried out to demonstrate this and as such 
Zhigang support that one needs to actually determine the expression of the protein to 
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be sure of expression. Applicant also argues that Alberts [a] (4th ed. 2002; Exhibit 2). 
figure 6-3 on page 302 illustrates the general principle that there is a correlation 
between increased gene expression and increased protein expression. In response to 
this argument, while increased transcript levels can lead to increased polypeptide 
levels, there are other regulatory factors that also effect the rate of translation as 
evidenced by Alberts [b] (Exhibit 1) in Figure 9-72. Additionally, Meric et al (Molecular 
Cancer Therapeutics, 1 :971-979, 2002, Ids reference 17, filed 3/22/2005) teaches that 
in addition to variations in mRNA sequences that increase or decrease translational 
efficiency, changes in the expression or availability of components of the translational 
machinery (i.e., over-expression of elF4E, elF4G, elF-2a, elF-4A1, ect...) as well as 
activation of translation through aberrantly activated signal transduction pathways also 
effect the rate of translation in cancerous cells. Figure 6-3 of Exhibit 2 (Alberts, 4 th ed. 
2002) does not account for these other types of controls that exist in cancerous cells. 
Applicant argues that Meric et al states at page 791 , left column that the fundamental 
principle of molecular therapeutics is to exploit differences in gene expression between 
cancer cells and normal cells and most efforts have concentrated on identifying 
differences in gene expression at the level of mRNA, which can be attributable to either 
DNA amplification or to differences in transcription and applicant concludes that those of 
skill in the art would not be focusing on differences in gene expression between cancer 

| cells and normal cells if there were no correlation between gene expression and protein 

i 

i expression. First, the statements by Meric appear to have been taken out of context. 
; Meric indicates most efforts have concentrated on gene expression at the mRNA level 
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due to the advent of cDNA array technology, which facilitated this type of analysis. 
Further, Meric et al in agreement with Alberts and Lewin acknowledges that gene 
expression is quite complicated and is regulated at the level of mRNA stability, mRNA 
translation and protein stability and Meric goes on to discuss that the components of the 
translation machinery and signal pathways involved in the activation of translation 
initiation represent good targets for cancer therapy (see pages 975-976). If it is the 
accepted understanding in the art that there is a direct correlation between mRNA levels 
and the level of expression of the encoded polypeptide, there would not be a need to 
target the translational machinery, unless of course the two are regulated separately. 

Further, applicant argues that the statement of Jang et al (cited previously by the 
examiner) that "further studies are necessary to determine if changes in protein levels 
track with changes in mRNA levels for metastasis associated genes in murine tumor 
cells." does not imply that the reason for additional research is needed is because the 
levels of mRNA and protein were measured and found not to correlate, rather, the 
statement simply acknowledges that Jang did not attempt to correlate mRNA and 
protein levels, and thus further research would be required to do so. In response to this 
argument, the examiner recognizes that the statement by Jang does not mean that 
mRNA and protein levels were measured and found not to correlate, the point was the 
acknowledgement by Jang that further research would be required to determine if a 
correlation between mRNA and protein levels actually exists. Again, if it is established 
that the accepted understanding in the art that there is a direct correlation between 
mRNA levels and the level of expression of the encoded protein, Jang et al would not 
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state that "further studies are necessary to determine if changes in protein levels track 
with changes in mRNA levels for metastasis associated genes in murine tumor cells." 

Applicant acknowledges that the examiners citations of Vallejo et al, Powell et al, 
and Fu et al as examples of post-transcriptional regulation of protein levels, they are not 
inconsistent with applicant's position that mRNA levels correlate, more often than not , 
with protein levels. In response to this argument, and in agreement with the art of 
Vallejo et al, Powell et al, Fu et al and Jang et al, Gygi et al (Molecular and Cellular 
Biology, 19(3): 1720-1 730, March 1999) states "We found that the correlation between 
mRNA and protein levels was insufficient to predict protein expression levels from 
quantitative mRNA data. Indeed, for some genes, while the mRNA levels were of the 
same value the protein levels varied by more than 20-fold. Conversely, invariant 
steady-state levels of certain proteins were observed with respective mRNA transcript 
levels that varied by as much as 30-fold." (see abstract). Also, Haynes et al (1998, 
Electrophoresis 19:1862-1871 , Ids reference 10 filed 3/22/2005), who studied more than 
80 proteins relatively homogeneous in half-life and expression level, and found no 
strong correlation between polypeptide and transcript level. For some genes, 
equivalent mRNA levels translated into protein abundances, which varied more than 50- 
fold. Haynes et al concluded that the protein levels cannot be accurately predicted from 
the level of the corresponding mRNA transcript (p. 1863, second paragraph, and Figure 
1 ). In agreement with Gygi and Haynes, Hanish S. [a] (Nature Reviews, Applied 
Proteomics Collection, pp. 9-14, March 2005) recently stated "There is a need to profile 
gene expression at the level of the proteome and to correlate changes in gene- 
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expression profiles with changes in proteomic profiles. The two are not always linked- 
numerous alterations occur in protein levels that are not reflected at the RNA level." 
(see page 12). Further, Hanash [a] teaches that tumors are complex biological systems 
and no single type of molecular approach fully elucidates tumor behavior, necessitating 
analysis at multiple levels encompassing genomics and proteomics (see abstract). 
Hanash et al [b] (The Pharmacogenomics Journal, 3(6):308-311, 2003) states "However 
perfected DNA microarrays and their analytical tools become for disease profiling, they 
will not eliminate a pressing need for other types of profiling technologies that go 
beyond measuring RNA levels, particularly for disease-related investigations." (see 
page 31 1). According to Hanash et al [b], there is a need to assay protein levels and 
activities and numerous alterations may occur in proteins that are not reflected in 
changes at the RNA level (see page 31 1). Clearly, contrary to applicant's arguments 
and as evidenced by the art above, it is not established in the art that the accepted 
understanding is that there is a direct correlation between mRNA levels and the level of 
expression of the encoded protein. The literature supports that RNA expression cannot 
inevitably be correlated with levels of the encoded polypeptide and one skilled in the art 
would not assume that the levels of RNA are predictive of the levels of the encoded 
polypeptide given the distinct regulation of transcription and translation as evidenced by 
Alberts, Lewin, Meric, Jang et al, Vallejo et al, Powell et al, Fu et al, Gygi et al, Haynes 
et al, Hanash S [a] and Hanash et al [b]. One skilled in the art would do further 
research to determine whether or not the PRO1069 polypeptide was under-expressed 
in kidney tumor samples. Such further research requirements make it clear that the 
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asserted utility is not yet in currently available form, i.e., it is not substantial. This further 

experimentation is part of the act of invention and until it has been undertaken, 

Applicant's claimed invention is incomplete. This situation is directly analogous to that 

which was addressed in Brenner v. Manson, 148 U.S.P.Q. 689 (Sup. Ct, 1966), in which 

the court held that 

"The basic quid pro quo contemplated by the Constitution and the 
Congress for granting a patent monopoly is the benefit derived by the 
public from an invention with substantial utility", "[u]nless and until a .. 
process is refined and developed to this point-where specific benefit exists 
in currently available form-there is insufficient justification for permitting an 
applicant to engross what may prove to be a broad field" and "a patent is 
not a hunting license" "[i]t is not a reward for the search, but compensation 
for its successful conclusion." 

Applicant refers to three additional articles previously submitted by Applicant 

(Orntoft et al; Exhibit filed 8/16/2004, Hyman et al; Exhibit filed 8/16/2004, and Pollack 

et al: Exhibit filed 8/16/2004) as providing evidence that gene amplification generally 

correlates with levels of the encoded polypeptide. Applicant characterizes Orntoft et al 

as teaching mRNA and protein levels for individual genes located within amplified or 

deleted chromosomal regions and found that of the 40 proteins analyzed only one 

showed disagreement between transcript alteration and protein alteration (Orntoft, page 

42). This has been fully considered, but is not found to be persuasive. Orntoft appear 

to have looked at increased DNA content over large regions of chromosomes and 

comparing that to mRNA and polypeptide levels from the chromosomal region. This 

approach to investigating gene copy number was termed CGH. Orntoft et al do not 

appear to look at gene amplification, mRNA levels and polypeptide levels from a single 

gene at a time. The instant specification reports data regarding amplification of 
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individual genes, which may or may not be in a chromosomal region, which is highly 
amplified. Orntoft et al concentrated on regions of chromosomes with strong gains of 
chromosomal material containing clusters of genes (page 40). This analysis was not 
done for PRO1069 in the instant specification. That is, it is not clear whether or not 
PRO1069 is in a gene cluster in a region of a chromosome that is highly amplified. 
Therefore, the relevance of Orntoft et al is not clear. Hyman et al used the same CGH 
approach in their research. Less than half (44%) of highly amplified genes showed 
mRNA over-expression (abstract). Polypeptide levels were not investigated. Therefore, 1 
Hyman et al also do not support utility of the claimed polypeptides. Pollack et al also 
used CGH technology, concentrating on large chromosome regions showing high 
amplification (page 12965). Pollack et al did not investigate polypeptide levels. 
Therefore, Pollack et al also do not support the asserted utility of the claimed invention. 
Importantly none of the three papers reported that the research was relevant to 
identifying probes that can be used as cancer diagnostics. The three papers state that 
•the research was relevant to the development of potential cancer therapeutics, but also 
clearly imply that much further research was needed before such therapeutics were in 
readily available form. Accordingly, the specifications assertions that the claimed 
PR0 1 069 polypeptides have utility in the fields of cancer diagnostics and cancer 
therapeutics are not substantial. 

For these reasons the rejection is maintained. 
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9. The rejection of claims 1-5 under 35 U.S.C. 112, first paragraph, is maintained. 
Specifically, since the claimed invention is not supported by a substantial utility or a 
well-established utility for the reasons set forth above, one skilled in the art clearly 
would not know how to use the claimed invention. 

10. The rejection of claims 1-5 under 35 U.S.C. 112, first paragraph, because the 
claims contain subject matter, which was not described in the specification in such a 
way as to enable one skilled in the art to which it pertains, or with which it is most nearly 
connected, to make and/or use the invention is maintained. 

The response filed 3/22/2005 has been carefully considered, but is deemed not 
to be persuasive. The response argues that in general differential expression levels of 
mRNA leads to differential protein expression levels and this is the general 
understanding in the art and the references cited by the examiner are exceptions to the 
general rule. Applicant relies on example 18 of the specification, the art of Zhigang and 
Meric for support and states that the totality of the evidence clearly establishes that 
those of skill in the art would believe that mRNA levels more likely than not correlate 
with protein levels. In response to this argument and as discussed above in the utility 
rejection the art of Alberts, Lewin, Meric, Jang et al, Vallejo et al, Powell et al, Fu et al, 
Gygi et al, Haynes et al, Hanash S [a] and Hanash et al [b] underscores the 
unpredictability in the art and the predictability of protein translation and its possible use 
as a diagnostic are not necessarily contingent on the levels of mRNA expression due to 
the multitude of homeostatic factors affecting transcription and translation. In view of 
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the totality of evidence of record, one of skill in the art could not predictably use the 
antibodies of the present claims as a diagnostic or therapeutic agent with a reasonable 
expectation of success. 

New Grounds of Rejections 
Priority 

Applicant claims priority to five previous applications in the preliminary 
amendment of 09 September 2002. Priority is granted to PCT/US00/23328, filed 24 
August 2000, as the disclosure of '328 is identical to the instant disclosure. However, 
priority is not granted to USSN 09/380,137, PCT/US99/12252 and 60/088,740 since 
these applications do not disclose the microarray assay upon which applicant relies for 
utility of the instantly claimed polypeptides. Therefore, the filing date for the purpose of 
art rejections is deemed to be 24 August 2000. Applicant is reminded that benefit to a 
prior-filed application requires written description and enablement under the first 
paragraph of 35 U.S.C. 112. 

Claim Rejections - 35 USC § 102 

1 1 . The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that 
form the basis for the rejections under this section made in this Office action: 

A person shall be entitled to a patent unless - 

(a) the invention was known or used by others in this country, or patented or described in a printed 
publication in this or a foreign country, before the invention thereof by the applicant for a patent. 

(e) the invention was described in (1 ) an application for patent, published under section 122(b), by another 
filed in the United States before the invention by the applicant for patent or (2) a patent granted on an 
application for patent by another filed in the United States before the invention by the applicant for patent, 
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except that an international application filed under the treaty defined in section 351(a) shall have the effects 
for purposes of this subsection of an application filed in the United States only if the international application 
designated the United States and was published under Article 21(2) of such treaty in the English language. 

12. Claims 1-2 and 4-5 are rejected under 35 U.S.C. 102(a) as being anticipated by 
Lai et al (WO 00/00610, 1/6/2000, cited previously on PTO-892 mailed 4/15/2004). 

The claims are drawn to an antibody that specifically binds to the polypeptide of 
SEQ ID NO:50, wherein the antibody is a monoclonal antibody, an antibody fragment 
and is labeled. 

Lai et al teach a polypeptide (SEQ ID NO:35), which is identical to the instantly 
claimed polypeptide of SEQ ID NO:50 and antibodies that bind the polypeptide are 
monoclonal, antibody fragments and labeled (see pages 44-45 and 52-53). 

13. Claims 1-2 and 4-5 are rejected under 35 U.S.C. 102(e) as being anticipated by 
Walker et al (U.S. Patent 6,277,574 B1, 4/9/1999). 

The claims have been described supra. 

Walker et al teach a polypeptide (SEQ ID NO:1 1 ) that is identical to the 
polypeptide of SEQ ID NO:50 (see the alignment attached to the back of this Office 
Action; Exhibit A) and Walker teaches monoclonal antibodies and antibody fragments 
that specifically bind the polypeptide and the antibodies may be labeled with a 
therapeutic agent for treating disease in a subject (see column 13). 

14. Claims 1-5 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Walker et al (U.S. Patent 6,277,574 B1 , 4/9/1999) in view of Queen et al (U.S. Patent 
5,530,101 , issued 6/96, cited previously on PTO-892 mailed 4/15/2004). 
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The claims have been described supra. Claim 3 recites wherein the antibody is a 
humanized antibody. 

Walker et al have been described supra. Walker et al does not teach a 
humanized antibody. This deficiency is made up for in the teachings of Queen et al. 

Queen et al teach humanized antibodies for human therapy (see entire 
document). 

It would have been prima facie obvious to one of ordinary skill in the art at the 
time the claimed invention was made to have produced a humanized antibody to the 
polypeptide of Walker et al in view of Queen et al. 

One of ordinary skill in the art would have been motivated to and had a 
reasonable expectation of success to have produced a humanized antibody to the 
polypeptide of Walker et al in view of Queen et al because Walker et al teach the 
polypeptide of SEQ ID NO:50 (i.e., SEQ ID NO:1 1 of Walker et al) is associated with 
kidney disease and it would be obvious in view of Queen et al who teaches humanized 
antibodies to humanize the antibody of Walker et al for human therapy. 

Therefore, the invention as a whole was prima facie obvious to one of ordinary 
skill in the art at the time the invention was made, as evidenced by the references. 

Conclusions 

15. No claim is allowed. 
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16. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to David J. Blanchard whose telephone number is (571 ) 
272-0827. The examiner can normally be reached at Monday through Friday from 8:00 
AM to 6:00 PM, with alternate Fridays off. If attempts to reach the examiner by ;. 
telephone are unsuccessful, the examiner's supervisor, Jeffrey Siew, can be reached at 
(571 ) 272-0787. The official fax number for the organization where this application or 
proceeding is assigned is 571-273-8300. Any inquiry of a general nature, matching or 
filed papers or relating to the status of this application or proceeding should be directed 
to the Kim Downing for Art Unit 1642 whose telephone number is 571 -272-0521. '.v /i /; v 

Information regarding the status of an application may be obtained from the - 
patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
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Integrated global profiling of cancer 



I 



SamirHanash 

Tumours are complex biological systems. 
No single type of molecular approach fully 
elucidates tumour behaviour, necessitating 
analysis at multiple levels encompassing 
genomics and proteomics. Integrated data 
sets are required to fully determine the 
contributions of genome alterations, host 
factors and environmental exposures to 
tumour growth and progression, as well as 
the consequences of interactions between 
malignant or premalignant cells and their 
microenvironment. The sheer amount and 
heterogeneous nature of data that need to 
be collected and integrated are daunting, 
but effort has already begun to address 
these obstacles. 

First published in Nature Reviews Cancer A, 638-644 (2004) 
doi:10.!038/nrcl409 

In the 1980s, at the dawn of the era of mole- 
cular medicine, researchers believed that 
cancer was caused by dysregulation of a few 
oncogenes or tumour-suppressor genes. The 
identification of these genes would therefore 
lead to effective approaches for preventing or 
treating cancer. Substantial progress has 
been made in uncovering cancer genes that 
are altered through point mutations, dele- 
tions, amplifications, rearrangements or 
other events, and as a result effective targeted 
therapies for certain cancers have been 
developed. It has become clear, however, that 
human tumours are more complex and het- 
erogeneous than expected, and are caused by 
defects in numerous pathways and factors 
that operate at many levels. For example, a 
gene can be amplified 100-fold in certain 
tumours with no demonstrable effect 
on RNA levels for that gene. Alternatively, 



protein levels can be increased, decreased or 
modified with no demonstrable changes in 
the levels of their corresponding RNAs. It is 
therefore a challenge to fully understand 
tumour behaviour, based on a single type of 
analysis. The factors that determine the con- 
sequences of a particular event or alteration 
can be highly context dependent, and are 
governed by the spatial and temporal activ- 
ity of numerous interacting components. 
The intricate nature of the contributions of 
many factors ultimately determines the 
impact that a particular alteration has on the 
properties of a tumour or a precursor lesion. 

There are two basic approaches to address 
the complexity of cancer. One is to reduce 
complexity through analysis of experimental 
models, such as cell lines or animal models, to 
characterize the fundamental processes of 
tumour growth and to elucidate the effects 
of single genes. Another is to integrate large 
data sets, to yield a model for tumour 
develop-ment and behaviour. Each approach 
has its own advantages and disadvantages. 
The first approach has been effective in 
many respects; for example, the early stages of 
tumorigenesis have been investigated using 
mouse models, and transformation and 
metastasis have been modelled in DrosophilaK 
However, in studying animal models of cancer, 
many factors that are relevant to human can- 
cer are lost. The conclusions reached from 
these models are therefore not always appli- 
cable to human tumours 2 . The second 
approach, involving integration of large data 
sets, is challenging in part because only a lim- 
ited number of samples, such as tumours or 
preneoplastic tissues, can be analysed in a 
given study. This makes data interpretation 



and model development difficult, given the 
large amount of heterogeneity between 
human tumours. 

Profiling strategies 

Improving our understanding of cancer and 
developing theoretical models will require an 
increased understanding of the contribu- 
tions of and interactions between the 
numerous components that contribute to 
tumour formation and progression (FIG. i). 
Strategies are available to profile changes at 
various levels, including the genome, tran- 
scriptome and proteome (table i). The host 
genome can be scanned for inherited varia- 
tions such as mutations and polymorphisms 
that might contribute to cancer risk. Tumour 
cells and their precursors can be assayed for 
genomic alterations, such as chromosomal 
deletions or amplification, or changes in 
DNA methylation status, that promote their 
proliferation and survival. The cancer-cell 
transcriptome can be examined for patterns 
of gene expression, or its proteome analysed 
to uncover alterations in proteins, that con- 
tribute to tumour development or progres- 
sion and would not be predicted by genome 
or transcriptome analysis. 

A challenge for global profiling is the 
need to capture all the elements of the indi- 
vidual compartments that are profiled, such 
as the whole transcriptome or the whole pro- 
teome. Although this is possible for the tran- 
scriptome, other compartments, such as the 
proteome and metabolome, have numerous 
features that are difficult to capture, requir- 
ing several different profiling approaches 
(table i). For example, it is not possible to 
assay for protein functional activity, profile 
protein-protein interactions, and assess pro- 
tein modifications all with the same plat- 
form. In all, there remains a substantial need 
to improve the breadth, sensitivity and 
throughput of global-profiling technologies. 

In addition to global profiling of DNA, 
RNA or protein in normal, premalignant 
and malignant tissues, and in biological 
fluids, a comprehensive analysis would 
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Figure 1 | Numerous components must be integrated to study the molecular basis of human 
cancer. Several host factors contribute to tumorigenesis in humans, including diet, environmental factors, 
polymorphisms and mutations in susceptibility genes, age and immunity. Cells undergo genomic changes 
(DNA mutations and repair, methylation, amplification, deletions and rearrangements), leading to 
tumorigenesis. Tumour development also depends on factors in the microenvironment — some of these 
are produced locally, whereas others are produced systemically (growth factors, infiltrating cells and 
cytokines). Reciprocal interactions between the premalignant and malignant cells, stromal cells, 
extracellular-matrix components, various inflammatory cells and a range of soluble mediators therefore 
contribute to tumour development and progression. Once tumour samples are obtained, genomic, 
transcriptomic and proteomic tools can be used to profile specific compartments. 



me sure other characteristics from these 
samples to detect changes in nutritional, 
metabolic and immune status, as well as to 
detect environmental exposures. These 
types of data come from metabolic and 
nutritional profiles, immunohistochemical 
assays, assays of host immunity to tumour 
antigens, and patient questionnaires. Such 
data need to be integrated with molecular 
profile data. 



Integrating data sets 

So far, very few cancer studies have attempted 
to integrate data sets that were obtained by 
several different profiling techniques. Rather, 
the few large-scale integrated molecular- pro- 
filing efforts undertaken have combined data 
of a similar nature, notably combining tran- 
scriptome data obtained from several sources. 
Some studies have combined data obtained 
through two different global-profiling 



platforms (genomic and transcriptom ic,or 
transcriptomic and proteomic) for the 
same set of study samples (such as lung 
tumours). These integrated data sets have 
also included variables such as clinical and 
pathological characteristics of the study 
individuals and their tumours, or muta- 
tions in cancer genes such as TP53 and 
RAS. However limited in scope, these stud- 
ies illustrate the potential impact of inte- 
grating data across numerous data sets in 
elucidating certain features of cancer 3 " 8 . 

Integrating gene-expression data from different 
sources. Profiling gene expression using DNA 
arrays has had a tremendous impact on bio- 
medical research. Although the field is still in its 
infancy, there is increasing emphasis on inte- 
gration of diverse sets of data. From a cancer 
research point of view, applications of global 
profiling of gene expression include uncover- 
ing unsuspected associations between genes, or 
identifying specific clinical features of cancer 
that result in novel molecular-based disease 
classifications. For example, DNA microarray 
analysis has been used to associate specific 
gene-expression profiles with different clinical 
outcomes of patients with the same types of 
tumours (responders versus non-responders 9 ), 
or with cancer subtypes of the same lineage 
(high-stage versus low-stage tumours). Specific 
gene-expression signatures have also been 
associated with tumours of different lineages 10 . 

Lamb et al? performed a study that illus- 
trates the merits of integrating gene-expression 
data from several sources to develop a mecha- 
nistic understanding. They integrated gene- 
expression data from cell lines and human 
tumours to uncover a cyclin-dependent kinase 
(CDK)-independent mechanism of cyclin Dl 
function. Cyclin D 1 , which activates CDK, is 
frequently overexpressed in human tumours, 



Table 1 | Profiling strategies for genome-related components 
Platform 

Genome 



What we can learn 

The hereditary components to cancer, 
as well as genome alterations in somatic 
cells that lead to cancer 



Transcriptome Changes in gene expression that are 
associated with cancer 



What is detected 

Chromosome structural changes; gene 
copy-number changes; gene 
rearrangements; mutations/polymorphisms; 
methylation changes 

Changes in RNA abundance; alterations 
in alternative splicing 



Tools used for analysis 

DNA sequencing; cytogenetics; 
CGH; array CGH; SNP analysis; RLGS 



Proteome 



How proteins are modified or how 
their levels change in tumours 



Differential-display analysis; SAGE; 
DNA microarray analysis; PCR- and 
non-PCR-based gene-expression assays 

Sample-enrichment strategies 
(fractionation, protein tagging); 
separation -based profiling {2D gels, MS, 
LC, LC-MS); non-separation-based 
strategies (protein microarrays, direct 
MS analysis); protein-detection 
strategies (immunohistochemistry, 
_ ■ ' immunofluorescence) 

2D, two dimensional; CGH, comparative genomic hybridization; LC. liquid chromatography; MS, mass spectrometry; PCR. polymerase chain reaction; RLGS. restriction 
landmark genome scanning; SAGE, serial analysis of gene expression; SNR single nucleotide polymorphism. 



Protein levels; post-translational 
modifications; localization; 
protein-protein interactions; 
enzymatic activity 
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Lung adenocarcinoma 
Small-cell lung cancer 
Hepatocellular carcinoma 
Ovarian carcinoma 
Colon carcinoma 
Prostate carcinoma 
Breast ductal carcinoma 
Salivary carcinoma 
Glioma 

Medulloblastoma 
Pancreatic carcinoma 
Bladder carcinoma 
Diffuse large-B-cell 
lymphoma 



trT^J ™n i gene - exp ^ SS f ,on P rof,,e of neoplasttc transformation. Public sharing of gene-expression data has led to the identification of 67 genes 
that are commonly overexposed in tumour samples, relative to normal tissue. This 'meta-signature' analysis compared 'cancer versus normal' gene expreSm 
signatures from 21 .ndependent microarray data sets. Thirteen 

wh.ch no changes ,n expression were observed between tumour and normal cells. Light and dark red boxes signify genes that were JS^^SS^ 
tumour cells, relat,ve to normal tissue. Dark red indicates that the expression level was in the 90- percentile of all samples tested FfaZSj^ 
permission from ref. n © (2004) National Academy of Sciences. 9 reproaucea w,m 



but the mechanisms by which this promotes 
tumorigenesis has been unclear. Cyclin D 1 and 
a cyclin-Dl mutant that was incapable of acti- 
vating CDK4 were each ectopically expressed 
in cultured human mammary epithelial cells. 
Twenty-one genes were found to be induced by 
both wild- type and mutant cyclin Dl, indicat- 
ing that these genes are CDK4 independent. 
Furthermore, the rapidity with which expres- 
sion of these genes was induced indicated the 
direct involvement of a transcription factor. A 
database of gene-expression profiles from 190 
primary human tumours was therefore also 
analysed, to identify cyclin-DI target genes. 
The expression pattern of the set of 2 1 genes 
uncovered from in vitro studies was correlated 
with the levels of cyclin D 1 in human tumours. 
A 'data-mining* process was applied to several 
human tumour gene-expression data sets, to 
identify genes that had a pattern of expression 
that matched the patterns of the genes that 
comprised the cyclin-Dl signature pattern. 
The transcription factor C/EBP0 was consis- 
tently co -expressed with the set of cyclin-Dl 
target genes. Functional analyses confirmed 
the involvement of C/EBPp in the transcrip- 
tional regulation of cyclin Dl. This study 
illustrates the types of findings that can be 
uncovered by integrating different sets of data. 

Tumour gene-expression patterns are 
modulated by many extrinsic factors and 
by the microenvironment — these features 
could be crucial factors in determining 
the response to anticancer drugs. The gene- 
expression profiles of in vitro cultures, of 



cancer cells have been compared with those of 
tumours grown in vivo, to determine the 
effects of the microenvironment on gene 
expression. In one study 4 , two human cancer 
cell lines (a lung adenocarcinoma and a 
glioblastoma cell line) were transplanted into 
immunodeficient mice and allowed to form 
tumours, and the gene-expression profiles of 
these tumours were compared with those of 
cells grown in culture. A bioinformatics 
approach was used to associate genes into 
functional classes. The classes of genes that 
were expressed at higher levels in cells grown 
in vitro were associated with increased cell 
division and metabolism, reflecting the more 
favourable environment for cell proliferation. 
By contrast, in vivo tumour growth resulted 
in upregulation of a significant number of 
genes involved in extracellular-matrix forma- 
tion, cell adhesion, cytokine and metallopro- 
teinase activity, and neovascularization. 
When placed in comparable in vivo tissue 
environments, the lung cancer and the 
glioblastoma cells expressed different sets of 
extracellular-matrix- and cell-adhesion- 
related genes, indicating different mecha- 
nisms of extracellular interaction at work in 
the different tumour types. Importantly, gene 
products that are typically targeted by cancer 
therapies, such as tyrosine kinases, showed 
varied expression patterns when the same 
cancer cells were grown in vitro versus in vivo. 
This provides an indication of why therapeu- 
tics that are effective in in vitro studies might 
not always function in vivo. 



A study that illustrates the merits of data 
sharing among investigators is a meta-analysis 
of cancer microarray data 11 . In this study, 
40 published cancer microarray data sets 
comprising gene-expression measurements 
from over 3,700 tumour samples were col- 
lected and analysed. A common transcrip- 
tional profile that is activated in most cancer 
types, relative to corresponding normal tis- 
sues, was delineated from some of the data 
sets, providing a meta-signature of neoplastic 
transformation (FJG. 2). 

Integrating genomic and transcriptomic data. 
Most tumours show numerous genomic alter- 
ations, but it has been a challenge to identify 
those that are required for different stages of 
tumour development. As most genome alter- 
ations — chromosomal gains and losses, dele- 
tions, amplification and methylation — affect 
the transcriptome, it would be useful to inte- 
grate genome profiling with transcriptome 
profiling. Several approaches are now available 
to scan the genome for gains and losses. 
These include fluorescence in situ hybridiza- 
tion, comparative genomic hybridization, 
hybridization of genomic DNA to various 
types of DNA microarrays, and restriction 
landmark genome scanning 5 " 8 . Additionally, 
oligonucleotide arrays are now available that 
can be used to detect single-nucleotide poly- 
morphisms and that allow genome-wide loss- 
of-heterozygosity maps to be developed from 
tumours, including samples isolated by 
laser-capture rrucrodissection 12 . 
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Pollack etal profiled DNA copy-number 
alterations across 6,691 mapped human genes 
in 44 samples of predominantly advanced, 
primary breast tumours and 10 breast cancer 
cell lines 13 . Parallel DNA microarray-based 
measurements of mRNA levels allowed assess- 
ment of the extent to which variation in gene 
copy number contributes to variation in gene 
expression in tumour cells. 62% of highly 
amplified genes showed increased expression 
levels. Additionally, DNA copy number corre- 
lated with gene expression across a range of 
DNA copy-number alterations, including 
deletions. On average, a twofold change in 
DNA copy number was associated with a cor- 
responding 1.5-fold change in mRNA levels. It 
was estimated that overall, at least 12% of all 
the variation in gene expression among the 
breast tumours analysed was attributable 
to underlying variation in gene copy number, 
the remainder presumably attributable to a 
multitude of other factors. 

In another study 14 , restriction landmark 
genomic scanning was used to detect amplified 
genomic DNA fragments in 47 primary ovar- 
ian tumours. This approach uncovered ampli- 
fication of the LMYC oncogene in several 
tumours. Transcriptome profiling of these 
tumours using oligonucleotide microarrays 
demonstrated frequent overexpression of 
LMYC in tumour cells, compared with cells of 
the normal ovarian surface epithelium — even 
in tumours without genomic amplification of 
LMYC — indicating that tumours use different 
mechanisms to upregulate LMYC expression. 
This finding prompted an assessment of the 
expression status of various members of 
the MYC gene family in ovarian tumours. 
Interestingly, a pattern was uncovered in which 
deregulated expression of one of the members 
of the MYCgene family was observed in most 
of the tumours. 

Integrating transcriptome and proteome 
profiling. There is a need to profile gene 
expression at the level of the proteome and to 
correlate changes in gene-expression profiles 
with changes in proteomic profiles. The two 
are not always linked — numerous alterations 
occur in protein levels that are not reflected at 
the RNA level 15 . Translation^ control is an 
important cellular process that is regulated by 
several genes with tumour-suppressor or 
oncogenic properties 16 . For example, the pro- 
teins encoded by the tumour-suppressor 
genes tuberous sclerosis 1 (TSC1) and TSC2 
form a functional complex that inhibits the 
phosphorylation of S6 kinase and 4EBP1 — 
two key regulators of mRNA translation. 
TSC2 functions as a key regulator of the TOR 
pathway, which regulates protein synthesis, 




Figure 3 1 Path from data collection and integration to hypothesis testing. Data produced by one 
research group can be combined with data in public databases such as the Cancer Genome Anatomy 
Project (CGAP) and further processed through resources available through various web sites - for example 
the National Center for Biotechnology Information (NCBI), National Cancer Institute Center for Bioinformatics 
(NCICB), CaCore and Gene Ontology (GO) web sites - to yield integrated data sets (for further information on 
these web sites, see the online links box). This type of 'data mining* using statistical and informatics tools can 
lead to models for tumour behaviours such as metastasis, recurrence or response to therapy Models can • 
then be tested experimentally and/or through collection and analysis of additional data sets, and then refined 



cell growth and viability in response to 
changes in cellular energy levels 17 . 

Given the distinct regulation of RNA and 
protein levels, integration of data pertaining to 
RNA and protein products that are encoded 
by the same genes can tell us a lot about 
tumour function. Nishizuka etal analysed 
gene-expression patterns of 60 human cancer 
cell lines (NCI-60) used by the National 
Cancer Institute to screen compounds for 
anticancer activity, and measured levels of 52 
cancer-related proteins in these cells 18 . 
Clustered image maps of protein levels uncov- 
ered two markers that could be used to distin- 
guish colon from ovarian adenocarcinomas. 
Integration of protein and mRNA data led to 
the interesting observation that the levels of 
structural proteins were highly correlated with 
the levels of their corresponding mRNAs in 
the NCI-60 cell lines, whereas the levels of 
non-structural proteins were poorly correlated 
with those of their corresponding mRNAs. 

Gene-expression and proteomic data sets 
from lung tumours have also been compared 
and integrated, along with serum samples 
from the same patients 19 " 21 . To determine 
whether gene- expression profiles could be 
used in prognosis, mRNA profiles in tumours 
from 86 newly diagnosed patients, including 
67 with early-stage and 19 with advanced- 
stage lung adenocarcinoma, were measured by 
oligonucleotide microarray analysis 19 . A gene- 
expression index, based on expression of the 
genes that correlated with survival of the 86 
patients, was able to identify low-risk and 
high-risk groups among the patients with 
stage-I lung adenocarcinomas. The index 



included many novel genes that were not pre- 
viously associated with survival in lung adeno- 
carcinoma. A large number of genes, such as 
the CRK oncogene, showed a graded pattern 
of expression among the tumours. A small 
number of genes, such as ERBB2, were only 
overexpressed in a small number of tumours, 
but were also correlated with poor outcome. 

In parallel, proteomic studies were under- 
taken to identify proteins associated with 
patient outcome 20 . A leave-one-out cross- 
validation procedure that analysed proteins 
associated with patient outcome — which 
were identified by Cox modelling — indicated 
that specific protein profiles can be used to pre- 
dict the likelihood of survival in patients with 
stage-I tumours. Integration of RNA and pro- 
tein data from the same tumours, and from an 
independent study, showed that 11 of 27 
mRNAs associated with survival were repre- 
sented in the profile of survival-associated pro- 
teins. Interestingly, combined analysis of 
protein and mRNA data revealed that i 1 com- 
ponents of the glycolysis pathway were associ- 
ated with poor outcome, either at the protein 
or RNA levels. Phosphoglycerate kinase I 
expression was associated with reduced patient 
survival time, based on both RNA and protein 
studies, and also based on immunohistochem- 
istry analysis using tissue microarrays in an 
independent validation set of 117 lung 
tumours. The relative abundance of this pro- 
tein in tumours led to the assessment of its lev- 
els in the sera of patients with lung cancer, 
revealing a correlation between increased 
serum levels of phosphoglycerate kinase 1 and 
poor outcome. 
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Challenges 

The studies presented above, although rela- 
tively simple from the point of view of extent 
of integration of heterogeneous data sets, illus- 
trate the merits of an integrated approach to 
tumour profiling. However, collecting and 
integrating sets of data that are quite diverse 
represents a substantial undertaking that 
necessitates resources not available to most 
investigators. Experimental data must be 
processed and stored in a manner that is com- 
patible with integration with other external, 
scattered data sources. Further complications 
stem from the substantial variation in the 
nomenclature used to identify the same object 
and to designate its attributes. For example, the 
protein encoded by a gene can be designated 
differently from the gene itself. Annotation 
with controlled vocabularies is required to 
achieve comparability across data sets. Even 
with adequate resources, the data generated is 
not always sufficiendy reliable for a meaningful 
integrated analysis. For example, for genes that 
are expressed at very low levels, mRNA and 
protein levels can show a lack of correlation 
simply because of the limited sensitivity of 
the measurements. 

Another serious challenge to studying can- 
cer pathogenesis is the effectiveness of devel- 
oping models capable of accounting for all 
the data collected with different high- 
throughput approaches. Although researchers 
have attempted for many years to devise 
mathematical models for many aspects of 
cancer, such as for tumour growth 22 , tumour 
drug delivery 23 or gene-environment interac- 
tions 24 , it is challenging to develop models 
that integrate the numerous pathways and 
factors that operate at various levels during 
tumour growth. Development of a model 
that would be able to predict the conse- 
quences of a particular mutation for tumori- 
genesis is more difficult than predicting 
the consequences of a mutation for a simple 
system, such as for a cultured microorganism. 

Models of human cancer are also impaired 
by the substantial lack of homogeneity among 
study populations and, most importantly, by 
the inability to manipulate components of the 
system. Furthermore, numerous members of 
the 'parts list* that is required to construct any 
model can not be measured or manipulated, 
or might not even have been identified. 
Initially, models might therefore represent 
approximations that generate hypotheses to 
be tested through further experimentation. 
Further experiments could then yield addi- 
tional data to allow a more robust model to be 
developed (fig. 3). For example, the finding that 
upregulation of glycolytic enzymes correlates 
with poor outcome in patients with lung 



cancer led to the finding that increased activity 
of the transcription factor hypoxia- inducible 
factor-la (HIFia), which is known to regulate 
expression of glycoly tic-pathway genes, was 
also correlated with poor survival in patients 
with lung cancer 25 . HIFia has since been 
associated with numerous tumour types. 

Resources 

An expanding array of resources, in the form 
of databases and tools, is available to allow 
experimental global profiling data and other 
types of data to be integrated. Fortunately, a 
large amount of data has become available on 
gene expression in normal and cancer cells 
through initiatives such as the Cancer 
Genome Anatomy Project and the Director's 
Challenge initiative, funded by the National 
Cancer Institute (NCI). There are also 
numerous other relevant data repositories. So 
an investigator who finds that a specific gene 
is upreguJated in a certain tumour type would 
be able to learn more about the expression 
pattern of this gene in other tumour types, as 
well as in normal tissues, through various 
gene-expression databases (BOX l). 

There are now numerous resources avail- 
able for mining data from various global- 
profiling techniques. One of the first publicly 
available web databases of pathway informa- 
tion is the Kyoto Encyclopedia of Genes and 
Genomes (KEGG) 26 . Over 150 pathways are 
represented with emphasis on well-defined 
metabolic pathways. The KEGG pathway 
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reference diagrams can be readily integra- 
ted with genomic and proteomic data. 
GenMAPP (Gene MicroArray Pathway 
Profiler) is a freely available program for 
viewing and analyzing expression data on 
'microarray pathway profiles' (MAPPs) repre- 
senting biological pathways or any other 
functional grouping of genes 27 . Over 50 
MAPP files depicting various biological path- 
ways and gene families are available. 
GenMAPP includes gene annotation infor- 
mation as described by the Gene Ontology 
(GO) Consortium 28 . The GenMAPP program 
identifies GO terms that seem to be over-rep- 
resented in a data set, providing clues to rele- 
vant biological processes. Transpath is an 
online web database on signal transduction 
and gene-regulatory pathways that lists over 
15,000 protein-protein interactions involving 
several thousand genes 29 . The Kinase Pathway 
Database 30 uses a natural language processing 
algorithm to automatically extract protein 
interaction information from the literature. 

Other resources include public databases 
of protein-protein interactions, namely the 
BiomolecuJar Interaction Database (BIND) 31 
and the Database of Interacting Proteins 32 . 
However, the organism most represented in 
these databases is Saccharomyces cerevisiae y 
for which substantial protein-protein inter- 
action data have been generated. (For further 
information on the resources discussed 
above and in the following section, see the 
online links box.) 



Box 1 1 So me of the resources available for 'data mining' in cancer research 

In addition to maintaining the GenBank nucleic-acid sequence database, the National Center 
for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the 
data in GenBank and other biological data made available through the NCBI web site 35 . 
Relevant NCBI resources include the Cancer Chromosome Aberratibn Project, Entrez 
Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, the Clusters of 
Orthologous Groups (COGs) database, SAGEmap, Gene Expression Omnibus (GEO) and the 
Molecular Modeling Database (MMDB). There are also available custom implementations of 
the BLAST program that are optimized to search specialized data sets. The National Cancer 
Institute, through its Center for Bioinformatics, provides informatics infrastructure support to 
advance translation^ cancer research. The centre provides open access to large and diverse data 
sets that result from NCI-funded initiatives. It also provides a resource that integrates such data 
with outside data and provides facilities for data management and distribution. The resource, 
designated CaCore, consists of a series of component technologies and services 36 . Enterprise 
Vocabulary Services provide controlled vocabulary, dictionary and thesaurus services. The 
Cancer Data Standards Repository provides a meta-data registry for common data elements. 
Cancer Bioinformatics Infrastructure Objects (caBIO) implements an object-oriented model of 
the biomedical domain and provides Java, Simple Object Access Protocol and HTTP-XML 
application programming interfaces. 

Other resources include GoMiner, developed by Zeeberg et a/. 37 . GoMiner is a resource 
package that organizes lists of genes, such as under- and overexpressed genes from a range of 
microarray experiments 28 . GoMiner provides quantitative and statistical output files and 
visualization graph structures. Genes that are displayed in GoMiner are linked to the main 
public bioinformatics resources. (For further information on the resources discussed above, 
see the online links box.) 



NATURE REVIEWS | APPLIED PROTEOMICS COLLECTION 



MARCH 2005TT3 



PERSPECTIVES 



Future directions 

Clearly, additional resources are needed to 
facilitate integration of diverse data sets. The 
NCI plans to deploy an integrating biomedical 
informatics infrastructure called the Cancer 
Biomedical Informatics Grid (CaBIG), which 
will be developed in partnership with the can- 
cer-research community. Around 50 cancer 
centres have joined this NCI-led project. The 
goals of CaBIG are to integrate data from 
diverse sources and to support interoperable 
analytic tools. The open-source, open-access 
grid will allow different research groups to 
search the expanding collection of cancer 
research data together with locally generated 
data. A similar and related effort is also under- 
way in the United Kingdom, where the 
National Cancer Research Institute, which 
represents government, philanthropic and pri- 
vate-sector organizations that fund cancer 
research, has set up a unit to develop cancer 
research informatics. This will facilitate inte- 
gration of data generated by laboratories 
across different organizations. 

Apart from informatics considerations, 
tumour-profiling technologies would benefit 
from miniaturization of assays and increases 
in throughput and sensitivity, given the lim- 
ited availability of tumour tissues. For exam- 
ple, the availability of proteome-scale capture 
agents would facilitate the use of microarrays 
in proteomic profiling, in a manner similar 
to transcriptome profiling. The availability 
of technologies for global profiling using 
formalin-fixed tissue would also be beneficial. 

Understanding cancer as a complex dis- 
ease, through systems-biology or systems- 
pathology approaches, requires teams of 
investigators from diverse fields such as 
biomedicine, chemistry, engineering, infor- 
matics and computational modelling. Soon, 
data obtained from molecular imaging 
studies might also be integrated. The con- 
tinued development of sensitive molecular- 
imaging-based assays that do not require 
tissue samples will be valuable for monitor- 
ing molecular and cellular processes in both 
animal models of cancer and in humans 33 . 
Integration of molecular imaging with 
other molecular approaches to tissue analy- 
sis could add a spatial and a temporal per- 
spective to our understanding of tumour 
development and progression. 

The need for multidisciplinary research 
into cancer and other diseases has been recog- 
nized by the National Institutes of Health 
(NIH) with the implementation of the 'NIH 
roadmap' 34 . A systems-biology approach to 
cancer that incorporates different genome- 
scale global-profiling technologies is expected 
to lead to the development of computational 



models of gene regulation in cancer and 
important cancer-related cell processes, 
such as differentiation, proliferation, trans- 
formation and metastasis. This will lead to 
molecular-based classifications of cancer 
that transcend organ and tissue types — 
these should supercede classifications based 
on histopathology or based on the expres- 
sion patterns of genes with unknown func- 
tional significance. New and important 
features of tumorigenesis and tumour pro- 
gression will be uncovered in this manner, 
leading to more effective screening strategies 
and therapeutic targets. 
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Profiling gene expression using DNA 
arrays has had a tremendous impact 
on biomedical research. From a disease 
investigation point of view, applica- 
tions of DNA microarrays include 
uncovering unsuspected associations 
between genes and specific clinical 
features of disease, resulting in novel, 
molecular-based disease classifica- 
tions. Cancer is a case in point. Most 
published studies of cancers using 
DNA microarrays have either exam- 
ined a pathologically homogeneous 
set of tumors to identify clinically 
relevant subtypes, for example, re- 
sponded vs nonresponders, or patho- 
logically distinct subtypes of cancer of 
the same lineage, for example, high- 
stage vs low-stage tumors to identify 
molecular correlates, or tumors of 
different lineages to identify molecular 
signatures for each lineage. A study of 
cutaneous T-cell lymphoma by Kari et 
al, 1 published recently, typifies both 
what one hopes to gain from disease 
investigations using DNA microarrays 
and the limitations of such studies. 

Primary cutaneous lymphomas are a 
heterogeneous group of lymphomas of 
T- or B-cell origin that represent a 
relatively common type of lymphoma 
and their incidence appears to be 
increasing. The two predominant sub- 
types of cutaneous T-cell lymphomas 
are mycosis fungoides, a mostly 
indolent variety, and its leukemic 
counterpart the Sezary syndrome, an 



aggressive variety characterized by 
skin involvement, lymphadenopathy 
and circulating atypical lymphocytes, 
the so-called Sezary cells. Kari et al 
used cDNA microarrays to study gene 
expression patterns in peripheral 
blood mononuclear cells from patients 
with the leukemic form of cutaneous 
T-cell lymphoma. The goal of the 
study was to identify markers that 
may be useful for diagnosis or prog- 
nosis, or that might provide new 
targets for treating this disease. The 
approach was to uncover gene expres- 
sion differences between cells from 18 
patients with high Sezary cell counts 
and an appropriate (Th2-skewed) cell 
fraction from nine normal controls. 
The differences in gene expression 
observed reflected many of the ob- 
served characteristics of the disease. 
Overexpressed genes in disease sam- 
ples included some genes required for 
Th2 differentiation characteristic of 
Sezary cells. The analysis, however, 
did not uncover changes consistent 
with the hypothesis of defective apop- 
totic pathways in this disease. Ah 
important objective of the study was 
to identify markers for cutaneous T- 
cell lymphoma given the paucity of 
such markers. A member of the plastin 
gene family and a chemokine 
(CX3CR1) inappropriately expressed 
represented such potential novel mar- 
kers. Two- genes found to have a high 
predictive power to classify patients 
and controls were STAT4 and GTPase 
RhoB. These two genes alone accu- 
rately classified the high Sezary cell 
patients and controls. A signature 
profile with 10 genes was uncovered 



that identified a class of patients who 
succumb to the disease early, irrespec- 
tive of their tumor burden. The study 
therefore uncovered a wealth of find- 
ings that shed some light on the 
biology of this disease and uncovered 
markers that may have a practical 
utility. 

The DNA microarray studies de- 
scribed above and others in the litera- 
ture indeed point to the great utility of 
DNA microarrays for uncovering pat- 
terns of gene expression that are 
clinically informative. Have the data 
been thoroughly analyzed? There is no 
shortage of analytical tools for unco- 
vering patterns in microarray data. An 
important challenge for microarray 
analysis is to understand at a mechan- 
istic level the significance of associa- 
tions observed between subsets of 
genes and clinical features of disease. 
Another challenge is to identify the 
smallest but most informative sets of 
genes associated with specific clinical 
features, which then could be inter- 
rogated using technologies available in 
clinical laboratories, as appears to have 
been accomplished in this study. An- 
other challenge is to determine how 
well RNA levels of predictive genes 
correlate with protein levels. A lack of 
correlation may imply that the pre- 
dictive property of the gene(s) is 
independent of gene function. 

To increase the effectiveness of DNA 
microarray analysis, global gene ex- 
pression data may be combined with 
external data sources, such as gene 
annotation, in order to associate the 
expression patterns of a set of genes 
with the biological processes that they 
may represent. A welcome trend of 
data sharing allows others to analyze 
previously published microarray data 
and to combine multiple data sets. For 
illustration, we examined the data set 
published by Kari et al to see what we 
could uncover. In our analysis, we 
relied on the Gene Ontology (GO) 
annotation. The Gene Ontology Con- 
sortium 2 has defined a controlled 
vocabulary for describing genes in 
terms of their molecular function, 
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participation in biological processes 
and cellular locations. The GO anno- 
tations are making possible the high- 
throughput analyses of gene expres- 
sion in terms of functional gene class 
associations, which otherwise would 
require laborious and somewhat sub- 
jective manual literature searches. 

Using the data set from Kari et al, we 
searched a set of 122 genes found 
overexpressed in patients with high 
blood tumor burden, or Sezary cell 
count, compared to healthy controls 
(P<0.01, fold change >1.5), for sig- 
nificantly enriched (over-represented) 
GO terms, as described elsewhere. 3 We 
made the same search for a set of 280 
genes found underexpressed in pa- 
tients with high Sezary cell count 
(P<0.01, fold change <0.67). Our 
premise is that annotation terms that 
are shared by a significant number of 
genes within a large gene set may 
provide clues as to the processes driv- 
ing the coordinate expression of the 
genes as a whole. Numerous enriched 



terms were found for the set of 280 
underexpressed genes with P< 0.001, 
including class II major histocompatibil- 
ity complex antigen (five genes repre- 
sented), cytokine-binding activity (six), 
mitochondrion (26), electron transporter 
activity (12) and nucleotide metabolism 
(four); these enriched terms could 
suggest a downregulation in CTCL of 
processes related to the immune re- 
sponse and mitochondrial function. 
Terms found enriched for the set of 
122 overexpressed genes with P<0.05 
include cell adhesion (nine genes 
represented) and cell cycle arrest 
(three). 

The enriched GO terms listed above 
represent only a fraction of the genes 
significantly expressed in CTCL, and 
additional gene-to-process associa- 
tions, not currently described in the 
biomedical literature or public annota- 
tion sources, may be inferred from 
data mining of large expression profile 
data sets. Our premise in this case 
is that genes that are coordinately 



expressed participate in closely related 
biological processes. 4 For a given gene, 
a GO term may be associated if the 
gene is correlated in expression with a 
significant number of other genes that 
share the given GO term annotation. 
We examined the expression patterns 
of 60 genes highly underexpressed in 
the Kari et al data set for patients with 
high Sezary cell count (P<0.01, fold 
change <0.33) that were also repre- 
sented in a large independent data set 
of leukemia expression profiles from 
Armstrong et al. s For each of the 60 
genes, the set of genes with significant 
positive correlations (P<0.01) with 
the given gene in the Armstrong data 
set was searched for significantly en- 
riched GO terms (P<0.0001). In this 
way, 1963 gene-to-term associations, 
involving all 60 genes, were found. We 
performed two simulation tests to 
assess the number of random gene-to- 
term associations that could exist in 
the Armstrong data set, in one test 
permuting the expression values and 
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Figure 1 Hierarchical clustering of associations of CO terms for genes found underexpressed in patients with high Sezary cell count 
(P<0.01, fold change <0.33). For each gene-to-term association represented here, the given gene was found positively correlated in 
expression with a significant number of other genes that share the given CO term annotation. The rows in the matrix diagram represent 
genes; the columns represent terms. An entry in the matrix indicates that the corresponding gene-to-term association was found in the 
leukemia profile data set from Armstrong et al with P<0.0001. Three major clusters are highlighted corresponding to terms related to (1) 
intercellular signaling, (2) the immune response, and (3) cell proliferation. Table 1 lists the genes that fall under each cluster. 
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Table 1 GO term associations from Figure 1 for genes underexpressed in patients with high Sezary cell counts 

Gene * Gene product description 

Cluster 7— integral to plasma membrane; receptor activity; signal transducer activity; ceil surface receptor-linked signal transduction; cell motility; 
G- protein-coupled receptor protein signaling pathway; cell-cell signaling; development; organogenesis; morphogenesis; extracellular 



CCL2 


Small inducible cytokine A2 


CD8B1 


CD8 antigen, beta polypeptide 1 (p37) 


CTSL 


Cathepsin L 


GPNMB 


Glycoprotein (transmembrane) nmb 


IL1R1 ; ? 


• Interleukin 1 receptor, type 1 


ITGB4 


Integrin, beta 4 


MAL 


Mai, T-cell differentiation protein 


MAOA 


Monoamine oxidase A 


ME1 


Malic enzyme 1, NADP(+)-dependent, cytosolic 


PLAU 


Plasminogen activator, urokinase 


STAT4 


Signal transducer and activator of transcription 4 


TNFAIP6 


Tumor necrosis factor, alpha-induced protein 6 



Cluster 2 — immune response; response to biotic stimulus; defense response; vacuole; lytic vacuole; lysosome 

CCL4 Small inducible cytokine A4 (homologous to mouse Mip-lb) 

CYP1B1 «/ Cytochrome P450, subfamily I (dioxin-inducible), polypeptide 1 (glaucoma 3, primary infantile) 

FCER2 Fc fragment of IgE, low-affinity II, receptor for (CD23A) 

GZMK Granzyme K (serine protease, granzyme 3; tryptase II) 

IL4R Interleukin 4 receptor 

MMP9 Matrix metalloproteinase 9 (gelatinase B, 92 kDa gelatinase, 92 kDa type IV collagenase) 

TIMP1 Tissue inhibitor of metalloproteinase 1 (erythroid potentiating activity, collagenase inhibitor) 



Cluster 3 — DNA repair; DNA replication; nucleolus; cell cycle; cell proliferation; mitosis; mRNA processing; mRNA splicing; ubiquitin-dependent 
protein catabolism; 26S proteasome; spliceosome complex; translation initiation factor activity; mitochondrion; oxidative phosphorylation; 
tricarboxylic acid cycle; cytochrome c oxidase activity 



AKAP9 A kinase (PRKA) anchor protein (yotiao) 9 

AP1B1 Adaptor-related protein complex 1, beta 1 subunit 

ATOX1 ATX1 (antioxidant protein 1, yeast) homolog 1 

ATP5G3 . ' ATP synthase, H+ transporting, mitochondrial FO complex, subunit c (subunit 9) isoform 3 

CD164 CD164 antigen, sialomucin 

CDC2 Cell division cycle 2, G1-S and G2-M 

DRG1 . V Developmental^ regulated GTP-binding protein 1 

HADH2 't "v Hydroxyacyl-coenzyme A dehydrogenase, type II 

HLA-DQB! : Major histocompatibility complex, class II, DQ beta 1 

LDHA Lactate dehydrogenase A 

LMNB2 . ■V. /. ' . ' Lamin B2 

NDUFS1 ' NADH dehydrogenase (ubiquinone) Fe-S protein 1 (75 kDa) (NADH-coenzyme Q reductase) 

OXCT 3-oxoacid CoA transferase 

PCNA Proliferating cell nuclear antigen 

RUNX1 Runt-related transcription factor 1 (acute myeloid leukemia 1; am!1 oncogene) 

SATB1 '< • , Special AT-rich sequence-binding protein 1 (binds to nuclear matrix/scaffold-associating DNA's) 

SLC25A1 1 Solute carrier family 25 (mitochondrial carrier; oxoglutarate carrier), member 1 1 

SPINT2 Serine protease inhibitor, Kunitz type, 2 

TOP2A ;•; . . Topoisomerase (DNA) II alpha (1 70 kDa) 

TXNRD1 V Thioredoxin reductase 1 

U&E2C Ubiquitin carrier protein E2-C 

VDAC1 f Voltage-dependent anion channel 1 

YWHAZ Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, zeta polypeptide 

ZMF204 ; ; Zinc-finger protein 204 



in another test permuting the annota- 
tion assignments. Neither search with 
these randomized data sets yielded 
more than 15 associations, indicating 
that most of the actual associations 
found were not the result of chance. 



We used average linkage hierarchical 
clustering 4 to obtain a global view of 
the gene-to-term associations mined 
from the Armstrong leukemia expres- 
sion data set. Figure 1 shows the 
resulting cluster diagram (with 113 



GO terms that were associated 
with at least five genes being repre- 
sented and with 49 genes that were 
associated with at least one of these 
terms). Genes are represented in 
the rows of the matrix diagram, and 
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terms are represented in the columns. 
An entry in the diagram indicates that 
the given gene (underexpressed in 
patients with high Sezary count com- 
pared to healthy, controls) was coex- 
pressed with a significant number of 
genes that share the given annotation 
term. GO terms that are closely related 
to each other by the biological con- 
cepts that they represent were found 
to cluster together. The clustering 
diagram defines three distinct major 
clusters of genes and terms related to 
intercellular signaling (labeled as Clus- 
ter 'Y in the figure), the immune 
response (labeled Cluster '2'), and cell 
proliferation (labeled Cluster '3')- 
Table 1 lists the genes that fall under 
each cluster, with example terms. 

Our GO term clustering analysis 
indicates that many of the genes 
underexpressed in CTCL may be asso- 
ciated with processes of cell prolifera- 
tion, the immune response or 
intercellular signaling, which suggests 
a hypothesis that the pathogenesis of 
CTCL involves a downregulation of 
these processes. CTCL is characterized 
by the accumulation of malignant 
cells with a low proliferative index, 
which appears consistent with the 
observation made here of numerous 
genes associated with proliferation 
being underexpressed in CTCL. The 
observed underexpression in CTCL of 
numerous genes involved in the im- 
mune response, including several 
genes encoding for the class II major 
histocompatibility antigen complex, 
might be construed as contradicting 



one hypothesis that CTCL may be a 
malignancy of T cells stimulated to 
proliferate against its own tumor anti- 
gens. 6 There has been much specula- 
tion that CTCL cells are defective in 
their apoptotic pathways, and that the 
disease is linked to an accumulation 
rather than a true proliferation of T 
cells. 7 Underexpressed genes in CTCL 
thought to mediate apoptosis, includ- 
ing STAT4, CTSL (cathepsin L), IL1R1 
(interleukin 1 receptor, type I) and 
TNFAIP6 (tumor necrosis factor, alpha- 
induced protein 6), are associated here 
with intercellular signaling-related 
terms. 

However perfected DNA microarrays 
and their analytical tools become for 
disease profiling, they will not elim- 
inate a pressing need for other types of 
profiling technologies that go beyond 
measuring RNA levels, particularly for 
disease-related investigations. DNA 
microarrays have limited utility for 
the analysis of biological fluids and 
for uncovering directly in the fluid, 
assay able biomarkers. There is a need 
to assay protein levels and activity. 
Numerous alterations may occur in 
proteins that are not reflected in 
changes at the RNA level, providing a 
compelling rationale for additional, 
direct analysis of gene expression at 
the protein level. The next challenge is 
to integrate RNA data with protein 
data. 
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We have determined the relationship between mRNA and protein expression levels for selected genes 
expressed in the yeast Saccharomyces cerevisiae growing at mid-log phase. The proteins contained in total yeast 
cell lysate were separated by high-resolution two-dimensional (2D) gel electrophoresis. Over 150 protein spots 
were excised and identified by capillary liquid cbromatography-tandem mass spectrometry (LC-MS/MS). 
Protein spots were quantified by metabolic labeling and scintillation counting. Corresponding mRNA levels 
were calculated from serial analysis of gene expression (SAGE) frequency tables (V. E. Velculescu, L. Zhang, 
W. Zhou, J. Vogelstein, M. A. Basrai, 1). E. Bassett, Jr., P. Hieter, B. Vogelstein, and K. W. Kinzler, Cell 
88:243-251, 1997). We found that the correlation between mRNA and protein levels was insufficient to predict 
protein expression levels from quantitative mRNA data. Indeed, for some genes, while the mRNA levels were 
of the same value the protein levels varied by more than 20-fold. Conversely, invariant steady-state levels of 
certain proteins were observed with respective mRNA transcript levels that varied by as much as 30-fold. 
Another interesting observation is that codon bias is not a predictor of either protein or mRNA levels. Our 
results clearly delineate the technical boundaries of current approaches for quantitative analysis of protein 
expression and reveal that simple deduction from mRNA transcript analysis is insufficient 



The description of the state of a biological system by the 
quantitative measurement of the system constituents is an es- 
sential but largely unexplored area of biology. With recent 
technical advances including the development of differential 
display-PCR (21), of cDNA microarray and DNA chip tech- 
nology (20, 27), and of serial analysis of gene expression 
(SAGE) (34, 35), it is now feasible to establish global and 
quantitative mRNA expression profiles of cells and tissues in 
species for which the sequence of all the genes is known. 
However, there is emerging evidence which suggests that 
mRNA expression patterns are necessary but are by them- 
selves insufficient for the quantitative description of biological 
systems. This evidence includes discoveries of posttranscrip- 
tional mechanisms controlling the protein translation rate (15), 
the halMives of specific proteins or mRNAs (33), and the 
intracellular location and molecular association of the protein 
products of expressed genes (32). 

Proteome analysis, defined as the analysis of the protein 
complement expressed by a genome (26), has been suggested 
as an approach to the quantitative description of the state of a 
biological system by the quantitative analysis of protein expres- 
sion profiles (36). Proteome analysis is conceptually attractive 
because of its potential to determine properties of biological 
systems that are not apparent by DNA or mRNA sequence 
analysis alone. Such properties include the quantity of protein 
expression, the subcellular location, the state of modification, 
and the association with ligands, as well as the rate of change 
with time of such properties. In contrast to the genomes of a 
number of microorganisms (for a review, see reference 11) and 
the transcriptome of Saccharomyces cerevisiae (35), which have 
been entirely determined, no proteome map has been com- 
pleted to date. 

The most common implementation of proteome analysis is 
the combination of two-dimensional gel electrophoresis (2DE) 
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(isoelectric focusing-sodium dodecyl sulfate [SDS]-polyacryl~ 
amide gel electrophoresis) for the separation and quantitation 
of proteins with analytical methods for their identification. 
2DE permits the separation, visualization, and quantitation of 
thousands of proteins reproducibly on a single gel (18, 24). By 
itself, 2DE is strictly a descriptive technique. The combination 
of 2DE with protein analytical techniques has added the pos- 
sibility of establishing the identities of separated proteins (1, 2) 
and thus, in combination with quantitative mRNA analysis, of 
correlating quantitative protein and mRNA expression mea- 
surements of selected genes. 

The recent introduction of mass spectrometry protein anal- 
ysis techniques has dramatically enhanced the throughput and 
sensitivity of protein identification to a level which now permits 
the large-scale analysis of proteins separated by 2DE. The 
techniques have reached a level of sensitivity that permits the 
identification of essentially any protein that is detectable in the 
gels by conventional protein staining (9, 29). Current protein 
analytical technology is based on the mass spectrometry gen- 
eration of peptide fragment patterns that are idiotypic for the 
sequence of a protein. Protein identity is established by corre- 
lating such fragment patterns with sequence databases (10, 22, 
37). Sophisticated computer software (8) has automated the 
entire process such that proteins are routinely identified with 
no human interpretation of peptide fragment patterns. 

In this study, we have analyzed the mRNA and protein levels 
of a group of genes expressed in exponentially growing cells of 
the yeast S. cerevisiae. Protein expression levels were quantified 
by metabolic labeling of the yeast proteins to a steady state, 
followed by 2DE and liquid scintillation counting of the se- 
lected, separated protein species. Separated proteins were 
identified by in-gel tryptic digestion of spots with subsequent 
analysis by microspray liquid chromatography-tandem mass 
spectrometry (LC-MS/MS) and sequence database searching. 
The corresponding mRNA transcript levels were calculated 
from SAGE frequency tables (35). 

This study, for the first time, explores a quantitative com- 
parison of mRNA transcript and protein expression levels for 
a relatively large number of genes expressed in the same met- 
abolic state. The resultant correlation is insufficient for predic- 
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FIG. 1. Schematic illustration of proteomc analysis by 2DE and mass spectrometry. In part I, proteins are separated by 2DE, stained spots are excised and subjected 
to in-gel digestion with trypsin, and the resulting peptides are separated by on-line capillary high-performance liquid chromatography. In part II, a peptide is shown 
cluting from the column in pan I. The peptide is ionized by cicctrospray ionization and enters the mass spectrometer. The mass of the ionized peptide is detected, and 
the first quadrupole mass filter allows only the specific mass-to-charge ratio of the selected peptide ion to pass into the collision cell. In the collision cell, the energized, 
ionized peptides collide with neutral argon gas molecules. Fragmentation of the peptide is essentially random but occurs mainly at the peptide bonds, resulting in smaller 
peptides of differing lengths (masses). These peptide fragments arc detected as a tandem mass (MS/MS) spectrum in the third quadmpole mass filter where two ion 
scries arc recorded simultaneously, one each from sequencing inward from the N and C termini of the peptide, respectively. In part III, (he MS/MS spectrum from the 
selected, ionized peptide is compared to predicted tandem mass spectra computer generated from a sequence database. Provided that the peptide sequence exists in 
the database, the peptide and, by association, the protein from which the peptide was derived can be identified. Unambiguous protein identification is attained in a single 
analysis because multiple peptides are identified as being derived from the same protein. 



tion of protein levels from mRNA transcript levels. We have 
also compared the relative amounts of protein and mRNA 
with the respective codon bias values for the corresponding 
genes. This comparison indicates that codon bias by itself is 
insufficient to accurately predict either the mRNA or the pro- 
tein expression levels of a gene. In addition, the results dem- 
onstrate that only highly expressed proteins are detectable by 
2DE separation of total cell lysates and that therefore the 
construction of complete proteome maps with current technol- 
ogy will be very challenging, irrespective of the type of organ- 
ism. 

MATERIALS AND METHODS 

Yeast strain end growth conditions. The source of protein and message tran- 
scripts for all experiments was YPH499 (MAT* uro3-52 Iys2-80J ode2-10t 
Icu2-M his3-&200 (rpl-&63) (30). Logarithmically growing cells were obtained by 
growing yeast celts to early log phase (3 X 10* cells/ml) in YPD rich medium 
(YPD supplemented with 6 mM uracil, 4.8 mM adenine, and 24 mM tryptophan) 
at 30°C (33). Metabolic labeling of protein was accomplished in YPD medium 



exactly as described elsewhere (4) with the exception that t ml of cells was 
labeled with 3 mQ to offset methionine present in YPD medium. Protein was 
harvested as described by Garrels and coworkers (12). Harvested protein was 
lyophilizcd, resuspended in isoelectric focusing gel rehydration solution, and 
stored at -8CTC 

2DE. Soluble proteins were run in the first dimension by using a commercial 
flatbed electrophoresis system (Multiphor II; Pharmacia Biotech). Immobilized 
polyacrylamidc gel (IPG) dry strips with nonlinear pH 3.0 to 10.0 gradients 
( Am ersham- Pharmacia Biotech) were used for the first-dimension separation. 
Forty micrograms of protein from whole-cell lysates was mixed with IPG strip 
rehydration buffer (8 M urea, 2% Nonidet P-40, 10 mM dilhiothreitol) > and 250 
to 380 u.1 of solution was added to individual lanes of an IPG strip rehydration 
tray (Amcrsham-Pharmacia Biotech). The strips were allowed to rchydrate at 
room temperature for I h. The samples were run at 300 V-10 roA-5 W for 2 h, 
then ramped to 3,500 V-10 inA-5 W over a period of 3 h, and then kept at 3,500 
V-10 mA-S W for 15 to 19 h. At the end of the first-dimension run (60 to 70 kV • 
h), the IPG strips were reequilibratcd for 8 min in 2% (wt/vol) dithiothreilol in 
2% (wl/vol) SDS-* M urea-30% (wttol) glyccrol-0.05 M Tris HCl (pH 6.8) and 
for 4 min in 2.5% iodoacctamidc in 2% (wt/vol) SDS-6" M urea-30% (wtArol) 
gtycerol-O.OS M Tris HCI (pH 6.8). Following recquilibration, the strips were 
transferred and apposed to 10% polyacrylamidc second-dimension gels. Poty- 
acrylamide gels were poured in a casting stand with 10% acrylamide-2.67% 
piperazinc diacrytamide-0 J75 M Tris base-HQ (pH &X)-0.l% (wt/vol) SDS-0.05% 
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FIG. 1 2D silver-stained gel of the proteins in yeast total cell rysaie. Proteins were separated in the first dimension (horizontal) by isoelectric focusing and then in 
the second dimension (vertical) by molecular weight sieving. Protein spots (156) were chosen to include the entire range of molecular weights, isoelectric focusing points, 
and staining intensities. Spots were excised, and the corresponding protein was identified by mass spectrometry and database searching. The spots are labeled on the 
gc) and correspond to the data presented in Table 1. Molecular weights arc given in thousands. 



(wt/vul) ammonium persulfatc-0.05% TEMED (/v^XA'-telrametbylethyl- 
encdiamine) in Milli-0 water, The apparatus used to run second-dimension gels 
was a noncommercial apparatus from Oxford Glycoscicnccs, Inc. Once the IPG 
strips were apposed to the second-dimension gels, they were immediately run at 
SO mA (constant)-500 V-85 W for 20 min, followed by 200 ra A (constant)-500 
V-85 W until the buffer front tine was 10 to 15 mm from the bottom of the gel 
Gels were removed and silver stained according to the procedure of Shcvchcnko 
et al. (29). 

Protein identification. Gels were exposed to X-ray film overnight, and then the 
silver staining and film were used to excise 156 spots of varying intensities, 
molecular weights, and isoelectric focusing points. In order to increase the 
detection limit by mass spectrometry, spots were cut out and pooled from up to 
four Identical cold, silver-stained gels. In- gel tryptic digests of pooled spots were 
performed as described previously (29). Tryptic peptides were analyzed by mi* 
crocapillary LC-MS with automated switching to MS/MS mode for peptide 
fragmentation. Spectra were searched against the composite OWL protein se- 
quence database (version 30.2; 250,514 protein sequences) (24a) by using the 
computer program Scqucst (8), which matches theoretical and acquired tandem 
mass spectra. A protein match was determined by comparing the number of 
peptides identified and their respective cross-correlation scores. All protein 
identifications were verified by comparison with theoretical molecular weights 
and isoelectric points. 



roRNA quantitation. Velculcscu and coworkers have previously generated 
frequency tables for yeast coRN A transcripts from the same strain grown under 
the same stated conditions as described herein (35). The SAGE technology is 
based on two main principles. First, a short sequence tag (15 bp) that contains 
sufficient information uniquely to identify a transcript is generated. A single tag 
is usually generated from each mRNA transcript in the cell which corresponds to 
15 bp at the 3'-most cutting site for Main. Second, many transcript lags can be 
concatenated into a single molecule and then sequenced, revealing the identity of 
multiple tags simultaneously. Over 20,000 transcripts were sequenced from yeast 
strain YPH499 growing at mid-log phase on glucose. Assuming the previously 
derived estimate of 15,000 mRNA molecules per cell (16), this would represent 
a 1.3-fold coverage even for mRNA molecules present at a single copy per cell 
and would provide a 72% probability of detecting such transcripts. Computer 
software which took for input the gene detected, examined the nucleotide se- 
quence, and performed the calculation as described by Velculescu and coworkers 
(35) was written. In practice, we found that for 21 of 128 (16%) genes examined 
viable mRNA levels from SAGE data could not be calculated. This was because 
(i) no CATG site was found in the open reading frame (ORF)> (u) a CATG site 
was found but the corresponding 10-bp putative SAGE tag was not found in the 
frequency tables, or (ui) identical putative SAGE tags were present for multiple 
genes (e.g., TDH2_YEAST and TDH3JYEAST). 
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TABLE 1. Expressed genes identified from 2D gel in Fig. 2 TABLE \~Continued 
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Mol wt 


pi 


Spot no. 


YPD gene 
name* 
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abundance 
(HP copies/ 
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* YPD gene names arc available from the YPD website (39). 

6 NA, calculation could not be performed or was not available. 
f mRNA data inconclusive or NA. 

4 No methionines in predicted ORF; therefore, protein concentration was not 
determined. 

* Measured molecular weight or p! did not match theoretical molecular weight 
or pi. 



Protein quantitation. [ 35 S]methioninc-labeled gels were exposed to X-ray film 
overnight, and then the silver stain and film were used to excise 156 spots of 
varying intensities, molecular weights, and pis. The excised spots were placed in 
0.6-ml microcentrifuge tubes, and scintillation cocktail (100 was added. The 
samples were vortexed and counted. In addition, two parallel gels were electro- 
blotted to polyvinyl idene difluoride membranes. The membranes were exposed 
to X-ray film, and four intense single spots were excised from each membrane 
and subjected to amino acid analysis. For these four spots, a mean of 209 ± 4 
cpm/pmol of prate in/methionine was found. This number was used to quantitate 
all remaining spots in conjunction with the number of methionines present in the 
protein. 

To ensure that proteins were labeled to equilibrium, parallel 2D gels were 
prepared and run on yeast mctabolically labeled for 1, 2, 6, or 18 h. The 
corresponding 156 spots were excised from each gel, and radioactivity was mea- 
sured by liquid scintillation counting for each spot. Calculated protein levels were 
highly reproducible for all time points measured after 1 h. 

Calculation of codon bias and predicted half-life. Codon bias values were 
extracted from the YPD spreadsheet (17). Protein half -lives were calculated 
based on the N-end rule (33). When the N- terminal processing was not known 
experimentally, it was predicted based on the affinity of methionine aminopep- 
tidasc(31). 

RESULTS 

Characteristics of proteome approach. Nearly every facet of 
proteome analysis hinges on the unambiguous identification of 
large numbers of expressed proteins in cells. Several tech- 
niques have been described previously for the identification of 
proteins separated by 2DE, including N-terminal and internal 
sequencing (1, 2), amino acid analysis (38), and more recently 
mass spectrometry (25). We utilized techniques based on mass 
spectrometry because they afford the highest levels of sensitiv- 
ity and provide unambiguous identification. The specific pro- 
cedure used is schematically illustrated in Fig. 1 and is based 
on three principles. First, proteins are removed from the gel by 



proteolytic in-gel digestion, and the resulting peptides are sep- 
arated by on-line capillary high-performance liquid chromatog- 
raphy. Second, the eluting peptides are ionized and detected, and 
the specific peptide ions are selected and fragmented by the 
mass spectrometer. To achieve this, the mass spectrometer 
switches between the MS mode (for peptide mass identifica- 
tion) and the MS/MS mode (for peptide characterization and 
sequencing). Selected peptides are fragmented by a process 
called collision-induced dissociation (CiD) to generate a tan- 
dem mass spectrum (MS/MS spectrum) that contains the pep- 
tide sequence information. Third, individual CID mass spectra 
are then compared by computer algorithms to predicted spec- 
tra from a sequence database. This results in the identification 
of the peptide and, by association, the protein(s) in the spot. 
Unambiguous protein identification is attained in a single anal- 
ysis by the detection of multiple peptides derived from the 
same protein. 

Protein identification. Yeast total cell protein lysate (40 u,g), 
metabolically labeled with [ 35 S]methionine, was electro- 
phoretically separated by isoelectric focusing in the first dimen- 
sion and by SDS-10% polyacrylamide gel electrophoresis in 
the second dimension. Proteins were visualized by silver stain- 
ing and by autoradiography. Of the more than 1,000 proteins 
visible by silver staining, 156 spots were excised from the gel 
and subjected to in-gel tryptic digestion, and the resulting 
peptides were analyzed and identified by microspray LC- 
MS/MS techniques as described above. The proteins in this 
study were all identified automatically by computer software 
with no human interpretation of mass spectra. They are indi- 
cated in Fig. 2 and detailed in Table 1. 

The CID spectra shown in Fig. 3 indicate that the quality of 
the identification data generated was suitable for unambiguous 
protein identification. The spectra represent the amino acid 
sequences of tryptic peptides NSGDIVNLGSIAGR (Fig. 3A) 
and FAVGAFTDSLR (Fig. 3B). Both peptides were derived 
from protein S57593 (hypothetical protein YMR226C), which 
migrated to spot 114 (molecular weight, 29,156; pi, 6.59) in the 
2D gel in Fig. 2. Five other peptides from the same analysis 
were also computer matched to the same protein sequence. 

Protein and mRNA quantitation. For the 156 genes investi- 
gated, the protein expression levels ranged from 2,200 (PGM2) 
to 863,000 (TDH2/TDH3) copies/cell. The levels of mRNA for 
each of the genes identified were calculated from SAGE fre- 
quency tables (35). These tables contain the mRNA levels for 
4,665 genes in yeast strain YPH499 grown to mid-log phase in 
YPD medium on glucose as a carbon source. In some in- 
stances, the mRNA levels could not be calculated for reasons 
stated in Materials and Methods. For the proteins analyzed in 
this study, mean transcript levels varied from 0.7 to 473 copies/ 
cell. 

Selection of the sample population for mRNA-protein ex- 
pression level correlation. The protein spots selected for iden- 
tification were selected from spots visible by silver staining in 
the 2D gel. An attempt was made not to include spots where 
overlap with other spots was readily apparent. The number of 
proteins identified was 156 (Table 1). Some proteins migrated 
to more than one spot (presumably due to differential protein 
processing or modifications), and protein levels from these 
spots were calculated by integrating the intensities of the dif- 
ferent spots. The 156 protein spots analyzed represented the 
products of 128 different genes. Genes were excluded from the 
correlation analysis only if part of the data set was missing; i.e., 
genes were excluded if (i) no mRNA expression data were 
available for the protein or putative SAGE tags were ambig- 
uous, (ii) the amino acid sequence did not contain methionine, 
(iii) more than a single protein was conclusively identified as 



Vol 19, 1999 

A 



CORRELATION BETWEEN PROTEIN AND mRNA LEVELS IN YEAST 1725 



1001 



75- 



(0 
T3 
C 
3 



5 50 



S 



25- 



> 201.7 
G 



86.3 



687,4 



560.2 



487.4 



373.2 



44 



NSGDIVNLGSIAGR 



787.7 



887.0 



< > Q 

<-> 



1000.4 




200 400 600 800 1000 1200 

m/z 



B 



100 



75- 



c 

I 

c 

3 



§ SO 



? 
I 

© 

DC 



25- 



218.5 



120.3 



FAVGAFTDSLR 



867.1 

! 



592.0 



G 4828 

v . 

<- > <4--~i > 



200 400 



4 



738.4 



600 

m/z 



968.4 



1037.7 



800 



1000 



1200 
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migrating to the same gel spot, or (iv) the theoretical and 
observed pis and molecular weights could not be reconciled. 
After these criteria were applied, the number of genes used in 
the correlation analysis was 106. 



Codon bias and predicted half-lives. Codon bias is thought 
to be an indicator of protein expression, with highly expressed 
proteins having large codon bias values. The codon bias distri- 
bution for the entire set of more than 6,000 predicted yeast 
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gene ORFs is presented in Fig. 4A. The interval with the 
largest frequency of genes is between the codon bias values of 
0.0 and 0.1. This segment contains more than 2,500 genes. The 
distribution of the codon bias values of the 128 different genes 
found in this study (all protein spots from Fig. 2) is shown in 
Fig. 4B, and protein half-lives (predicted from applying the 
N-end rule [33] to the experimentally determined or predicted 
protein N termini) are shown in Fig. 4C. No genes were iden- 
tified with codon bias values less than 0.1 even though thou- 
sands of genes exist in this category. In addition, nearly all of 
the proteins identified had long predicted half-lives (greater 
than 30 h). 

Correlation of raRNA and protein expression levels* The 
correlation between mRNA and protein levels of the genes 
selected as described above is shown in Fig. 5. For the entire 
group (106 genes) for which a complete data set was gener- 
ated, there was a general trend of increased protein levels 
resulting from increased mRNA levels. The Pearson product 
moment correlation coefficient for the whole data set (106 
genes) was 0.935. This number is highly biased by a small 
number of genes with very large protein and message levels. A 
more representative subset of the data is shown in the inset of 
Fig. 5, It shows genes for which the message level was below 10 
copies/cell and includes 69% (73 of 106 genes) of the data used 
in the study. The Pearson product moment correlation coeffi- 
cient for this data set was only 0.356. We also found that levels 
of protein expression coded for by mRNA with comparable 
abundance varied by as much as 30-fold and that the mRNA 
levels coding for proteins with comparable expression levels 
varied by as much as 20-fold. 

The distortion of the correlation value induced by the un- 
even distribution of the data points along the* axis is further 
demonstrated by the analysis in Fig. 6. The 106 samples in- 
cluded in the study were ranked by protein abundance, and the 
Pearson product moment correlation coefficient was repeat- 
edly calculated after including progressively more, and higher- 
abundance, proteins in each calculation. The correlation values 
remained relatively stable in the range of 0.1 to 0.4 if the 
lowest-expressed 40 to 95 proteins used in this study were 
included. However, the correlation value steadily climbed by 
the inclusion of each of the 11 very highly expressed proteins. 

Correlation of protein and mRNA expression levels with 
codon bias. Codon bias is the propensity for a gene to utilize 
the same codon to encode an amino acid even though other 
codons would insert the identical amino acid in the growing 
polypeptide sequence. It is further thought that highly ex- 
pressed proteins have large codon biases (3). To assess the 
value of codon bias for predicting mRNA and protein levels in 
exponentially growing yeast cells, we plotted the two experi- 
mental sets of data versus the codon bias (Fig. 7). The distri- 
bution patterns for both mRNA and protein levels with respect 
to codon bias were highly similar. There was high variability in 
the data within the codon bias range of 0.8 to 1.0. Although a 
large codon bias generally resulted in higher protein and mes- 
sage expression levels, codon bias did not appear to be predic- 
tive of either protein levels or mRNA levels in the cell. 

DISCUSSION 

The desired end point for the description of a biological 
system is not the analysis of mRNA transcript levels alone but 
also the accurate measurement of protein expression levels and 
their respective activities. Quantitative analysis of global 
mRNA levels currently is a preferred method for the analysis 
of the state of cells and tissues (11). Several methods which 
either provide absolute mRNA abundance (34, 35) or relative 
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on codon bias. No genes with codon bias values less than 0.1 were detected in this 
study. (Q Distribution of identified proteins in this study based on predicted 
half-life (estimated by N-end rule). 



mRNA levels in comparative analyses (20, 27) have been de- 
scribed elsewhere. The techniques are fast and exquisitely sen- 
sitive and can provide mRNA abundance for potentially any 
expressed gene. Measured mRNA levels are often implicitly or 
explicitly extrapolated to indicate the levels of activity of the 
corresponding protein in the cell. Quantitative analysis of pro- 
tein expression levels (proteome analysis) is much more time- 
consuming because proteins are analyzed sequentially one by 
one and is not general because analyses are limited to the 
relatively highly expressed proteins. Proteome analysis does, 
however, provide types of data that are of critical importance 
for the description of the state of a biological system and that 
are not readily apparent from the sequence and the level of 
expression of the mRNA transcript. This study attempts to 
examine the relationship between mRNA and protein expres- 
sion levels for a large number of expressed genes in cells 
representing the same state. 

Limits in the sensitivity of current protein analysis technol- 
ogy precluded a completely random sampling of yeast proteins. 
We therefore based the study on those proteins visible by silver 
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staining on a 2D gel. Of the more than 1,000 visible spots, 156 
were chosen to include the entire range of molecular weights, 
isoelectric focusing points, and staining intensities displayed on 
the 2D protein pattern. The genes identified in this study 
shared a number of properties. First, all of the proteins in this 
study had a codon bias of greater than 0.1 and 93% were 
greater than 0.2 (Fig. 4B). Second, with few exceptions, the 
proteins in this study had long predicted half-lives according to 
the N-end rule (Fig. 4C). Third, low-abundance proteins with 
regulatory functions such as transcription factors or protein 
kinases were not identified. 

Because the population of proteins used in this study ap- 
pears to be fairly homogeneous with respect to predicted half- 
life and codon bias, it might be expected that the correlation of 
the mRNA and protein expression levels would be stronger for 
this population than for a random sample of yeast proteins. We 
tested this assumption by evaluating the correlation value if 
different subsets of the available data were included in the 
calculation. The 106 proteins were ranked from lowest to high- 
est protein expression level, and the trend in the correlation 
value was evaluated by progressively including more of the 
higher-abundance proteins in the calculation (Fig. 6). The cor- 
relation value when only the lower-abundance 40 to 93 pro- 
teins were examined was consistently between 0.1 and 0.4. If 
the 11 most abundant proteins were included, the correlation 
steadily increased to 0.94. We therefore expect that the corre- 
lation for all yeast proteins or for a random selection would be 
less than 0.4. The observed level of correlation between 
mRNA and protein expression levels suggests the importance 



of posttranslational mechanisms controlling gene expression. 
Such mechanisms include translational control (15) and con- 
trol of protein half-life (33). Since these mechanisms are also 
active in higher eukaryotic cells, we speculate that there is no 
predictive correlation between steady-state levels of mRNA 
and those of protein in mammalian cells. 

Like other large-scale analyses, the present study has several 
potential sources of error related to the methods used to de- 
termine mRNA and protein expression levels. The mRNA 
levels were calculated from frequency tables of SAGE data. 
This method is highly quantitative because it is based on actual 
sequencing of unique tags from each gene, and the number of 
times that a tag is represented is proportional to the number of 
mRNA molecules for a specific gene. This method has some 
limitations including the following: (i) the magnitude of the 
error in the measurement of mRNA levels is inversely propor- 
tional to the mRNA levels, (ii) SAGE tags from highly similar 
genes may not be distinguished and therefore are summed, (iii) 
some SAGE tags are from sequences in the 3' untranslated 
region of the transcript, (iv) incomplete cleavage at the SAGE 
tag site by the restriction enzyme can result in two tags repre- 
senting one mRNA, and (v) some transcripts actually do not 
generate a SAGE tag (34, 35). 

For the SAGE method, the error associated with a value 
increases with a decreasing number of transcripts per cell. The 
conclusions drawn from this study are dependent on the qual- 
ity of the mRNA levels from previously published data (35), 
Since more than 65% of the mRNA levels included in this 
study were calculated to 10 copies/cell or less (40% were less 
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than 4 copies/cell), the error associated with these values may 
be quite large. The mRNA levels were calculated from more 
than 20,000 transcripts. Assuming that the estimate of 15,000 
mRNA molecules per cell is correct (16), this would mean that 
mRNA transcripts present at only a single copy per cell would 
be detected 72% of the time (35). The mRNA levels for each 
gene were carefully scrutinized, and only mRNA levels for 
which a high degree of confidence existed were included in the 
correlation value. 

Protein abundance was determined by metabolic radiolabel- 
ing with [ 35 S]methionine. The calculation required knowledge 
of three variables: the number of methionines in the mature 
protein, the radioactivity contained in the protein, and the 
specific activity of the radiolabel normalized per methionine. 
The number of methionines per protein was determined from 
the amino acid sequence of the proteins identified by tandem 
mass spectrometry. For some proteins, it was not known 
whether the methionine of the nascent polypeptide was pro- 
cessed away. The N termini of those proteins were predicted 
based on the specificity of methionine aminopeptidasc (31). If 
the N-terminal processing did not conform to the predicted 
specificity of processing enzymes, the calculation of the num- 
ber of methionines would be affected. This discrepancy would 
affect most the quantitation of a protein with a very low num- 
ber of methionines. The average number of calculated methi- 
onines per protein in this study was 7.2. We therefore expect 
the potential for erroneous protein quantitation due to un- 
usual N-terminal processing to be small. 



The amount of radioactivity contained in a single spot might 
be the sum of the radioactivity of comigrating proteins. Be- 
cause protein identification was based on tandem mass spec- 
trometry techniques, comigrating proteins could be identified. 
However, comigrating proteins were rarely detected in this 
study, most likely because relatively small amounts of total 
protein (40 u,g) were initially loaded onto the gels, which re- 
sulted in highly focused spots containing generally 1 to 25 ng of 
protein. Because of the relatively small amount loaded, the 
concentrations of any potentially comigrating protein would 
likely be below the limit of detection of the mass spectrometry 
technique used in this study (1 to 5 ng) and below the limit of 
visualization by silver staining (1 to 5 ng). In the overwhelming 
majority of the samples analyzed, numerous peptides from a 
single protein were detected. It is assumed that any comigrat- 
ing proteins were at levels too low to be detected and that their 
influence in the calculation would be small. 

The specific activity of the radiolabel was determined by 
relating the precise amount of protein present in selected spots 
of a parallel gel, as determined by quantitative amino acid 
composition analysis, to the number of methionines present in 
the sequence of those proteins and the radioactivity deter- 
mined by liquid scintillation counting. It is possible that the 
resulting number might be influenced by unavoidable losses 
inherent in the amino acid analysis procedure applied. Because 
four different proteins were utilized in the calculation and the 
experiment was done in duplicate, the specific activity calcu- 
lated is thought to be highly accurate. Indeed, the specific 
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as described in 



activities calculated for each of the four proteins varied by less 
than 10%. Any inconsistencies in the calculation of the specific 
activity would result in differences in the absolute levels calcu- 
lated but not in the relative numbers and would therefore not 
influence the correlation value determined. 

The protein quantitative method used eliminates a number 
of potential errors inherent in previous methods for the quan- 
titation of proteins separated by 2DE, such as preferential 
protein staining and bias caused by inequalities in the number 
of radiolabeled residues per protein. Any 2D gel-based method 
of quantitation is complicated by the fact that in some cases the 
translation products of the same mRNA migrated to different 
spots. One major reason is posttranslational modification or 
processing of the protein. Also, artifactua) proteolysis during 
cell lysis and sample preparation can lead to multiple resolved 
forms of the protein. In such cases, the protein levels of spots 
coded for by the same mRNA were pooled. In addition, the 
existence of other spots coded for by the same mRNA that 
were not analyzed by mass spectrometry or that were below the 
limit of detection for silver staining cannot be ruled out. How- 
ever, since this study is based on a class of highly expressed 
proteins, the presence of undetected minor spots below silver 
staining sensitivity corresponding to a protein analyzed in the 
study would generally cause a relatively small error in protein 
quantitation. 

Codon bias is a measure of the propensity of an organism to 
selectively utilize certain codons which result in the incorpo- 
ration of the same amino acid residue in a growing polypeptide 
chain. There are 61 possible codons that code for 20 amino 
acids. The larger the codon bias value, the smaller the number 
of codons that are used to encode the protein (19). It is 



thought that codon bias is a measure of protein abundance 
because highly expressed proteins generally have large codon 
bias values (3, 13), 

Nearly all of the most highly expressed proteins had codon 
bias values of greater than 0.8. However, we detected a number 
of genes with high codon bias and relative low protein abun- 
dance (Fig. 7). For example, the expressed gene with both the 
second largest protein and mRNA levels in the study was 
EN02_YEAST (775,000 and 289.1 copies/celt, respectively). 
EN01~YEAST was also present in the gel at much lower 
protein and mRNA levels (44,200 and 0.7 copies/cell, respec- 
tively). The codon bias values for EN02 and ENOl are similar 
(0.96 and 0.93, respectively), but the expression of the two 
genes is differentially regulated. Specifically, ENOIJYEAST is 
glucose repressed (6) and was therefore present in low abun- 
dance under the conditions used. Other genes with large codon 
bias values that were not of high protein abundance in the gel 
include EFT1, TIF1, HXK2, GSP1, EGD2, SHM2, and TALI. 
We conclude that merely determining the codon bias of a gene 
is not sufficient to predict its protein expression level. 

Interestingly, codon bias appears to be an excellent indicator 
of the boundaries of current 2D gel proteome analysis tech- 
nology. There are thousands of genes with expressed mRNA 
and likely expressed protein with codon bias values less than 
0.1 (Fig. 4 A). In this study, we detected none of them, and only 
a very small percentage of the genes detected in this study had 
codon bias values between 0.1 and 0.2 (Fig. 4B). Indeed, in 
every examined yeast proteome study (5, 7, 13, 28) where the 
combined total number of identified proteins is 300 to 400, this 
same observation is true. It is expected that for the more 
complex cells of higher eukaryotic organisms the detection of 
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low-abundance proteins would be even more challenging than 
for yeast. This indicates that highly abundant, long-lived pro- 
teins are overwhelmingly detected in proteome studies. If pro- 
teome analysis is to provide truly meaningful information 
about cellular processes, it must be able to penetrate to the 
level of regulatory proteins, including transcription factors and 
protein kinases. A promising approach is the use of narrow- 
range focusing gels with immobilized pH gradients (IPG) (23). 
This would allow for the loading of significantly more protein 
per pH unit covered and also provide increased resolution of 
proteins with similar electrophoretic mobilities. A standard pH 
gradient in an isoelectric focusing gel covers a 7-pH-unit range 
(pH 3 to 10) over 18 cm, A narrow-range focusing gel might 
expand the range to 0.5 pH units over 18 cm or more. This 
could potentially increase by more than 10-fold the number of 
proteins that can be detected. Clearly, current proteome tech- 
nology is incapable of analyzing low-abundance regulatory pro- 
teins without employing an enrichment method for relatively 
low-abundance proteins. In conclusion, this study examined 
the relationship between yeast protein and message levels and 
revealed that transcript levels provide little predictive value 
with respect to the extent of protein expression, 
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